age weight
Min. :2.312 Min. : 5.906
1st Qu.:3.324 1st Qu.:14.085
Median :3.715 Median :18.715
Mean :3.799 Mean :18.455
3rd Qu.:4.214 3rd Qu.:22.387
Max. :5.683 Max. :31.944
# ggplotly scatter plotlibrary(ggplot2)library(dplyr)library(plotly)# first the classical ggplot thing# we store the plot in object pp = data %>%ggplot(aes(x = age, y = weight)) +geom_point(size=4,alpha=.3,color="brown") +labs(title ="Scatter plot of weight vs age",x ="Age (years)",y ="Weight (kg)")# we convert p to an interactive plotly object p %>%ggplotly()
The previous steps just confirm what we already know: weight tends to increase with age.
Questions unaddressed include:
What is proportion of animals with wirght exceeding 32kg in the real world?
What is the growth rate (in kg per year)?
What is the strength of the association between age and weight?
Can we predict weight from age?
What is the uncertainty associated with such predictions?
Answering those questions requires to go beyond descriptive statistics/ data visualization and to fit a statistical model to the data.
2 What is a statistical model?
2.1 Definition
A statistical model is a set of (mathematical/probabilistic) assumptions that scientist formulate to represent the real-world process that generates data.
It describes the relationships between different variables in the data and allows us to make inferences (take guesses on data generating process) and predictions (take guesses on unobserved values) based on data and those relationships.
2.2 Example: animal weight as a normally distributed variable
If we make the assumption that the weight of animals in the population follows a normal distribution with mean mu and standard deviation sigma, we can use the data to estimate mu and sigma.
# estimate mu and sigma from the datamu_hat =mean(data$weight)sigma_hat =sd(data$weight)mu_hat
[1] 18.45524
sigma_hat
[1] 5.797247
# ggplot the empirical histogram (probability rather than count on the y axis)# + plot of estimated normal distribution data %>%ggplot(aes(x=weight)) +geom_histogram(aes(y=..density..),bins=20,fill="brown",alpha=.3) +stat_function(fun = dnorm, args =list(mean = mu_hat, sd = sigma_hat), color ="red3", linewidth =1) +labs(title ="Histogram of weight with estimated normal distribution",x ="Weight (kg)", y ="Density")
The model assumption that weight follows a normal distribution is a strong one, and it may not be entirely accurate (for example it implictly assume that a weight can be negative).
Formulating a statistical model is an acknowledgment that data values are contingent, in other words they could have been different (different sampling units, different errors in the measurement, different missingness due to incomplete reporting, etc).
A statistical model provides a useful framework for understanding the data and assessing how much trust we should place in the data at hand and the conclusions we derive from it.
For example, if we want to the estimate proportion of animals above 35kg. A straightforward (but questionable) answer could be: 0% (such animals have not been observed in the data at hand).
Using our estimated normal distribution, the answer would be different: in probability theory, this can be computed as \(1 - P(X \leq 35) = 1 - F(35)\) where F is the cumulative distribution function (CDF) of the normal distribution with mean \(\hat{\mu}\) and standard deviation \(\hat{\sigma}\).
In R we evaluate this number denoted \(\hat{p}\) as:
# proportion of animals above 35kgp_hat =1-pnorm(35, mean = mu_hat, sd = sigma_hat)signif(p_hat*100,dig=2)
[1] 0.22
3 General statistical principles
3.1 Data contingency and inference
Keep in mind that all conclusions you may draw from your data may not hold for the whole population your data were sampled from.
Extrapolating a conclusion from a sample to the whole population the data were sampled from is called the problem of statistical inference. This problem is ubiquitous in data science. It became overlooked in recent years with the rise of machine learning, which focuses on prediction accuracy rather than understanding the data generating process. However, statistical inference remains crucial for making valid conclusions from data.
There problem of statistical inference is somehow immaterial or of lesser importance when the data at hand represents the whole population and measurement errors are of small magnitude. This is ofen the case in Official Statistics when statistical units are countries and data from all countries are available and obtained from reliable data collection systems.
When conlusions are extrapolated from a sample to another population, there will always be some uncertainty associated with those conclusions. Quantifying that uncertainty is a key aspect of statistical thinking.
The uncertainty is captured by a probability. The statistical toolbox offers multiple tools to assess the uncertainty/significance of a conclusion.
For this assessment to be valid, the data collection and the evidence generation must be done in a rigorous way. This is where study design and statistical analysis plans comes into play.
3.2 Key considerations in designing a statistical study
Clear objectives: define the research question(s) precisely.
Target population & sampling: specify who/what is studied and ensure representativeness.
Study design choice: select appropriate design (e.g., experiment, cohort, survey) to minimize bias.
Control of confounding & bias: use randomization, blinding, stratification where relevant.
Sample size & power: plan enough observations for reliable inference.
3.3 Key considerations in designing a statistical analysis